To start using the dplyr package from the tidyverse to select columns and filter data.
In this workshop, the aim is to cover how to start working with the key library from the tidyverse, dplyr. We will be covering:
image credit: Analytics Vidhya
The tidyverse is a collection of R packages that are designed for
data science. These packages share design, syntax, and philosophy. These
packages cover the import of data (readr and
haven), manipulation and transformation of data
(dplyr, tidyr, stringr,
purrr, forcats, and lubridate),
visualisation (ggplot and it’s extensions), and analysis
(tidymodels).
Essentially, the tidyverse makes data science in R less painless, improving your experience of R and data science, especially in the data cleaning and wrangling stages.
The tidyverse has a focus on working with tidy data, or making data tidy, ready for visualisation and analysis. So what does tidy data mean?
When your data is tidy, each column is a variable, each row is an observation, and each cell is a single observation, as per our example below:
# tidy data example
tidy_df <- data.frame(
id = 1:6,
name = c("floof", "max", "cat", "donut", "merlin", "panda"),
colour = c("grey", "black", "orange", "grey", "black", "calico")
)
tidy_df## id name colour
## 1 1 floof grey
## 2 2 max black
## 3 3 cat orange
## 4 4 donut grey
## 5 5 merlin black
## 6 6 panda calico
Messy data is inconsistent and unique, making it harder to work with, and harder for others to work with. See this example of a messy dataset that would be hard to work with. We would have to split up the animal column to name and colour. In later workshops, we will cover how to deal with messy data.
# example messy data frame
messy_df <- data.frame(
id = c(1,1,2,2,3,3,4,4,5,5,6,6),
animal = c("floof", "grey",
"max", "black",
"cat", "orange",
"donut", "grey",
"merlin", "black",
"panda", "calico")
)
messy_df## id animal
## 1 1 floof
## 2 1 grey
## 3 2 max
## 4 2 black
## 5 3 cat
## 6 3 orange
## 7 4 donut
## 8 4 grey
## 9 5 merlin
## 10 5 black
## 11 6 panda
## 12 6 calico
Image credit: Julie Lowndes and Allison Horst
See this excellent article, which has lots of nice images, for a summary :-https://www.openscapes.org/blog/2020/10/12/tidy-data/
In this workshop we will be using three packages: magrittr, dplyr, and readr.
Using the code chunk below, install all three of these packages. Note
that dplyr is large and might take a minute or so to install, we have
added the Ncpus = 6 argument which should speed things up a
bit.
# your code here
install.packages("", Ncpus = 6)
install.packages("", Ncpus = 6)
install.packages("", Ncpus = 6)Also note that you can install the whole tidyverse with install.packages(“tidyverse”)! This takes a while though, so for this workshop we will just install individual packages.
The pipe operator in R comes from the magrittr package,
using syntax of %>%.
The pipe operator is for chaining a sequence of operations together. This has two main advantages: it makes your code more readable, and it saves some typing.
The syntax is data %>% function, as shown in the
example below. The data gets piped into the function.
library(magrittr)
data <- c(4.1 ,1.7, 1.1, 7.5, 1.7)
data %>% mean()## [1] 3.22
To see the difference between using pipes and not using pipes, look at the following examples.
We are going to calculate a mean of a vector of numbers, round the result, and print it using paste.
# Make some data: 20 randomly selected data points, from 1 to 10
x <- sample(1:10, 20, replace = TRUE)
y <- sample(1:10, 20, replace = TRUE)
# without pipe
y_mean <- mean(y)
y_mean <- round(y_mean, digits = 2)
y_mean <- paste("Mean value of y is", y_mean)
y_mean## [1] "Mean value of y is 5.6"
# without pipe in one line
paste("Mean value of y is", round(mean(y), digits = 2))## [1] "Mean value of y is 5.6"
Now lets have a look at how to do this same set of operations with pipes. The process is as follows: assign x to x_mean, then pipe to x to a mean function, pipe the result of mean to round, finally assign result to paste.
You will notice in the paste function we have used a .
after the text. This is called a place-holder, whereby instead
of using the data (like we did above without the pipe) we add a
. to tell R that is where we want our data to go.
# load in magrittr
library(magrittr)
# magrittr pipe
x_mean <- x %>% # assign result at the start
mean() %>%
round(digits = 2) %>%
paste("Mean value of x is", .) # we use the . as a place holder for a variable (e.g. instead of x)
x_mean## [1] "Mean value of x is 5.5"
Notice how we assign the result at the start just like we would usually do, then pipe from then on.
It is also worth mentioning that as of version 4.1 of R, base R comes
with a native pipe operator. This has just been introduced, and may get
more use in examples you’ll see online in the future. The syntax uses
|> as the pipe, and the structure is the same as a
magrittr pipe.
note that the native pipe currently doesn’t have a place-holder, so we won’t use paste in this example
# native R pipe
z <- sample(1:10, 20, replace = TRUE)
z_mean <- z |>
mean() |>
round(digits = 2)
z_mean## [1] 6.65
If the above example doesn’t work, it means you have a version of R that is less than 4.1. Run the below code chunk to test out your R version. If it is less than 4.1 you can update it after the workshop.
# test your r version
R.version.string## [1] "R version 4.1.2 (2021-11-01)"
We will be using the magrittr pipe (%>%) for the rest
of this workshop, as it’s currently the pipe operator you will come
across most in the r world.
Using the vector of temperature provided and using magrittr pipes:
hint: don’t forget to use the place-holder with paste
library(magrittr)
temperature <- c(10, 16, 12, 15, 14, 15, 20)
# your code hereDplyr is a package that is built for data manipulation, using
functions that describe what they do. For example, the
select() function selects columns you want, or don’t want,
from a data frame.
The dplyr package has a lot of functions built into the package, each has it’s own very helpful documentation page with examples - https://dplyr.tidyverse.org/reference/index.html
Dplyr functions work with and without pipes and you’ll see both when
searching online. If using a pipe, you call your data then pipe that to
a function, such as data %>% mean(). If you are not
using a pipe, you call your data within the function, such as
mean(data).
We will focus on two key dplyr functions for now:
select() and filter(). We will use the
messi_career data for the examples. Run the code chunk below to get the
data into r and have a look at it.
# create the messi career data
messi_career <- data.frame(Appearances = c(9,25,36,40,51,53,55,60,50,46,57,49,52,54,50,44),
Goals = c(1,8,17,16,38,47,53,73,60,41,58,41,54,45,51,31),
Season = c(2004,2005,2006,2007,2008,2009,2010,2011,2012,
2013,2014,2015,2016,2017,2018,2019),
Club = rep("FC Barcelona", 16),
Age = seq(17, 32),
champLeagueGoal = c(0,1,1,6,9,8,12,14,8,8,10,6,11,6,12,3))
# view the data
head(messi_career)## Appearances Goals Season Club Age champLeagueGoal
## 1 9 1 2004 FC Barcelona 17 0
## 2 25 8 2005 FC Barcelona 18 1
## 3 36 17 2006 FC Barcelona 19 1
## 4 40 16 2007 FC Barcelona 20 6
## 5 51 38 2008 FC Barcelona 21 9
## 6 53 47 2009 FC Barcelona 22 8
The select function subsets columns from a data frame using their name. There are several different ways of using select. Run each of the code chunks below and review the outputs.
First, we can give the names of the columns we want to select.
# load dplyr
library(dplyr)
# select single column
messi_career %>% select(Goals)## Goals
## 1 1
## 2 8
## 3 17
## 4 16
## 5 38
## 6 47
## 7 53
## 8 73
## 9 60
## 10 41
## 11 58
## 12 41
## 13 54
## 14 45
## 15 51
## 16 31
# select all but single column
messi_career %>% select(-Goals)## Appearances Season Club Age champLeagueGoal
## 1 9 2004 FC Barcelona 17 0
## 2 25 2005 FC Barcelona 18 1
## 3 36 2006 FC Barcelona 19 1
## 4 40 2007 FC Barcelona 20 6
## 5 51 2008 FC Barcelona 21 9
## 6 53 2009 FC Barcelona 22 8
## 7 55 2010 FC Barcelona 23 12
## 8 60 2011 FC Barcelona 24 14
## 9 50 2012 FC Barcelona 25 8
## 10 46 2013 FC Barcelona 26 8
## 11 57 2014 FC Barcelona 27 10
## 12 49 2015 FC Barcelona 28 6
## 13 52 2016 FC Barcelona 29 11
## 14 54 2017 FC Barcelona 30 6
## 15 50 2018 FC Barcelona 31 12
## 16 44 2019 FC Barcelona 32 3
# select multiple columns
messi_career %>% select(Appearances, Goals, Age)## Appearances Goals Age
## 1 9 1 17
## 2 25 8 18
## 3 36 17 19
## 4 40 16 20
## 5 51 38 21
## 6 53 47 22
## 7 55 53 23
## 8 60 73 24
## 9 50 60 25
## 10 46 41 26
## 11 57 58 27
## 12 49 41 28
## 13 52 54 29
## 14 54 45 30
## 15 50 51 31
## 16 44 31 32
Another method is using a range of columns, known as a slice. Here we are selecting columns from Season to Age, which includes the Club column as well. We can also combine this with the ! (not) operator to exclude those columns.
# select slice (or range) of columns
messi_career %>% select(Season:Age)## Season Club Age
## 1 2004 FC Barcelona 17
## 2 2005 FC Barcelona 18
## 3 2006 FC Barcelona 19
## 4 2007 FC Barcelona 20
## 5 2008 FC Barcelona 21
## 6 2009 FC Barcelona 22
## 7 2010 FC Barcelona 23
## 8 2011 FC Barcelona 24
## 9 2012 FC Barcelona 25
## 10 2013 FC Barcelona 26
## 11 2014 FC Barcelona 27
## 12 2015 FC Barcelona 28
## 13 2016 FC Barcelona 29
## 14 2017 FC Barcelona 30
## 15 2018 FC Barcelona 31
## 16 2019 FC Barcelona 32
# select slice and other columns
messi_career %>% select(Appearances:Season, champLeagueGoal)## Appearances Goals Season champLeagueGoal
## 1 9 1 2004 0
## 2 25 8 2005 1
## 3 36 17 2006 1
## 4 40 16 2007 6
## 5 51 38 2008 9
## 6 53 47 2009 8
## 7 55 53 2010 12
## 8 60 73 2011 14
## 9 50 60 2012 8
## 10 46 41 2013 8
## 11 57 58 2014 10
## 12 49 41 2015 6
## 13 52 54 2016 11
## 14 54 45 2017 6
## 15 50 51 2018 12
## 16 44 31 2019 3
# negate selection of columns
messi_career %>% select(!(Season:Age))## Appearances Goals champLeagueGoal
## 1 9 1 0
## 2 25 8 1
## 3 36 17 1
## 4 40 16 6
## 5 51 38 9
## 6 53 47 8
## 7 55 53 12
## 8 60 73 14
## 9 50 60 8
## 10 46 41 8
## 11 57 58 10
## 12 49 41 6
## 13 52 54 11
## 14 54 45 6
## 15 50 51 12
## 16 44 31 3
# negate selection with slice and extra column (note c() function used)
messi_career %>% select(!c(Season:Age, champLeagueGoal))## Appearances Goals
## 1 9 1
## 2 25 8
## 3 36 17
## 4 40 16
## 5 51 38
## 6 53 47
## 7 55 53
## 8 60 73
## 9 50 60
## 10 46 41
## 11 57 58
## 12 49 41
## 13 52 54
## 14 54 45
## 15 50 51
## 16 44 31
As you can see, select() makes it easy to extract
columns from your data, and becomes more useful the larger your dataset
becomes.
In the examples above we did not assign the result. See the examples below on how to do this.
# assign result to subset
messi_sub <- messi_career %>%
select(Appearances, Goals, Age)
messi_sub## Appearances Goals Age
## 1 9 1 17
## 2 25 8 18
## 3 36 17 19
## 4 40 16 20
## 5 51 38 21
## 6 53 47 22
## 7 55 53 23
## 8 60 73 24
## 9 50 60 25
## 10 46 41 26
## 11 57 58 27
## 12 49 41 28
## 13 52 54 29
## 14 54 45 30
## 15 50 51 31
## 16 44 31 32
# The no pipe method
messi_sub <- select(messi_career, Appearances, Goals, Age)For your exercises, you will be using imdb movie data! I’ve loaded it here in the code for you.
The data has 22 columns, some of which we won’t need. We can use
select to subset our data to keep only what we want.
glimpse()imdb_movie data so
you have the following columns: imdb_id through to writer, actors,
avg_vote to votes, reviews_from_users to reviews_from_critics. Assign
the result to imdb_subhint: you should be able to fit this into one select call
# load libraries
library(readr)
library(dplyr)
# load data
movies_imdb <- read_csv("https://raw.githubusercontent.com/andrewmoles2/rTrainIntroduction/main/r-data-wrangling-1/data/IMDb%20movies.csv")
# use glimpse to review data (tidyverse version of str())
movies_imdb %>% glimpse()## Rows: 85,855
## Columns: 21
## $ imdb_title_id <chr> "tt0000009", "tt0000574", "tt0001892", "tt000210…
## $ title <chr> "Miss Jerry", "The Story of the Kelly Gang", "De…
## $ year <dbl> 1894, 1906, 1911, 1912, 1911, 1912, 1919, 1913, …
## $ date_published <chr> "1894-10-09", "26/12/1906", "19/08/1911", "13/11…
## $ genre <chr> "Romance", "Biography, Crime, Drama", "Drama", "…
## $ duration <dbl> 45, 70, 53, 100, 68, 60, 85, 120, 120, 55, 121, …
## $ country <chr> "USA", "Australia", "Germany, Denmark", "USA", "…
## $ language <chr> "None", "None", NA, "English", "Italian", "Engli…
## $ director <chr> "Alexander Black", "Charles Tait", "Urban Gad", …
## $ writer <chr> "Alexander Black", "Charles Tait", "Urban Gad, G…
## $ production_company <chr> "Alexander Black Photoplays", "J. and N. Tait", …
## $ actors <chr> "Blanche Bayliss, William Courtenay, Chauncey De…
## $ description <chr> "The adventures of a female reporter in the 1890…
## $ avg_vote <dbl> 5.9, 6.1, 5.8, 5.2, 7.0, 5.7, 6.8, 6.2, 6.7, 5.5…
## $ votes <dbl> 154, 589, 188, 446, 2237, 484, 753, 273, 198, 22…
## $ budget <chr> NA, "$ 2250", NA, "$ 45000", NA, NA, NA, "ITL 45…
## $ usa_gross_income <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ worlwide_gross_income <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ metascore <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ reviews_from_users <dbl> 1, 7, 5, 25, 31, 13, 12, 7, 4, 8, 9, 9, 16, 8, N…
## $ reviews_from_critics <dbl> 2, 7, 2, 3, 14, 5, 9, 5, 1, 1, 9, 28, 7, 23, 4, …
# your code hereSo far we have selected just columns we named, but there are other
methods we can use. Dplyr has a number of helper functions that
come with select().
One such example is the contains() function, that finds
columns that contain the string a string. This is a useful option if you
just want to pick out columns that have some similar text in them.
# select by literal string
messi_career %>% select(contains("Goal"))## Goals champLeagueGoal
## 1 1 0
## 2 8 1
## 3 17 1
## 4 16 6
## 5 38 9
## 6 47 8
## 7 53 12
## 8 73 14
## 9 60 8
## 10 41 8
## 11 58 10
## 12 41 6
## 13 54 11
## 14 45 6
## 15 51 12
## 16 31 3
Other options are the starts_with() or
ends_with() helpers. You provide a string of what your
column either starts with or ends with, and they will be selected.
# columns starting with A
messi_career %>%
select(starts_with("A"))## Appearances Age
## 1 9 17
## 2 25 18
## 3 36 19
## 4 40 20
## 5 51 21
## 6 53 22
## 7 55 23
## 8 60 24
## 9 50 25
## 10 46 26
## 11 57 27
## 12 49 28
## 13 52 29
## 14 54 30
## 15 50 31
## 16 44 32
# columns ending with s
messi_career %>%
select(ends_with("s"))## Appearances Goals
## 1 9 1
## 2 25 8
## 3 36 17
## 4 40 16
## 5 51 38
## 6 53 47
## 7 55 53
## 8 60 73
## 9 50 60
## 10 46 41
## 11 57 58
## 12 49 41
## 13 52 54
## 14 54 45
## 15 50 51
## 16 44 31
# columns not starting with A
messi_career %>%
select(!starts_with("A"))## Goals Season Club champLeagueGoal
## 1 1 2004 FC Barcelona 0
## 2 8 2005 FC Barcelona 1
## 3 17 2006 FC Barcelona 1
## 4 16 2007 FC Barcelona 6
## 5 38 2008 FC Barcelona 9
## 6 47 2009 FC Barcelona 8
## 7 53 2010 FC Barcelona 12
## 8 73 2011 FC Barcelona 14
## 9 60 2012 FC Barcelona 8
## 10 41 2013 FC Barcelona 8
## 11 58 2014 FC Barcelona 10
## 12 41 2015 FC Barcelona 6
## 13 54 2016 FC Barcelona 11
## 14 45 2017 FC Barcelona 6
## 15 51 2018 FC Barcelona 12
## 16 31 2019 FC Barcelona 3
Using the imdb_sub dataset you made in the previous exercise:
|) statement with
select# your code hereIt is also helpful to change the order of your columns, and you can
use select to do this.
If we wanted to move the club column as the first column in our messi_career data, we could do it manually but naming all the columns like the example below.
# manually
messi_career %>%
select(Club, Appearances, Goals, Season, Age, champLeagueGoal)## Club Appearances Goals Season Age champLeagueGoal
## 1 FC Barcelona 9 1 2004 17 0
## 2 FC Barcelona 25 8 2005 18 1
## 3 FC Barcelona 36 17 2006 19 1
## 4 FC Barcelona 40 16 2007 20 6
## 5 FC Barcelona 51 38 2008 21 9
## 6 FC Barcelona 53 47 2009 22 8
## 7 FC Barcelona 55 53 2010 23 12
## 8 FC Barcelona 60 73 2011 24 14
## 9 FC Barcelona 50 60 2012 25 8
## 10 FC Barcelona 46 41 2013 26 8
## 11 FC Barcelona 57 58 2014 27 10
## 12 FC Barcelona 49 41 2015 28 6
## 13 FC Barcelona 52 54 2016 29 11
## 14 FC Barcelona 54 45 2017 30 6
## 15 FC Barcelona 50 51 2018 31 12
## 16 FC Barcelona 44 31 2019 32 3
This could get really messy if you have lots of data. Two helper
functions make this much easier: everything() and
last_col(). Everything selects every column not already
specified, so is useful if we want to move a column to the first column
in the dataset.
# move club to first column
messi_career %>%
select(Club, everything())## Club Appearances Goals Season Age champLeagueGoal
## 1 FC Barcelona 9 1 2004 17 0
## 2 FC Barcelona 25 8 2005 18 1
## 3 FC Barcelona 36 17 2006 19 1
## 4 FC Barcelona 40 16 2007 20 6
## 5 FC Barcelona 51 38 2008 21 9
## 6 FC Barcelona 53 47 2009 22 8
## 7 FC Barcelona 55 53 2010 23 12
## 8 FC Barcelona 60 73 2011 24 14
## 9 FC Barcelona 50 60 2012 25 8
## 10 FC Barcelona 46 41 2013 26 8
## 11 FC Barcelona 57 58 2014 27 10
## 12 FC Barcelona 49 41 2015 28 6
## 13 FC Barcelona 52 54 2016 29 11
## 14 FC Barcelona 54 45 2017 30 6
## 15 FC Barcelona 50 51 2018 31 12
## 16 FC Barcelona 44 31 2019 32 3
Last col calls the last column in your data frame, so we can call
last_col() to move ‘champLeagueGoal’ to the first column,
then use everything to keep the rest of the columns as they are.
# move last column to first column
messi_career %>%
select(last_col(), everything())## champLeagueGoal Appearances Goals Season Club Age
## 1 0 9 1 2004 FC Barcelona 17
## 2 1 25 8 2005 FC Barcelona 18
## 3 1 36 17 2006 FC Barcelona 19
## 4 6 40 16 2007 FC Barcelona 20
## 5 9 51 38 2008 FC Barcelona 21
## 6 8 53 47 2009 FC Barcelona 22
## 7 12 55 53 2010 FC Barcelona 23
## 8 14 60 73 2011 FC Barcelona 24
## 9 8 50 60 2012 FC Barcelona 25
## 10 8 46 41 2013 FC Barcelona 26
## 11 10 57 58 2014 FC Barcelona 27
## 12 6 49 41 2015 FC Barcelona 28
## 13 11 52 54 2016 FC Barcelona 29
## 14 6 54 45 2017 FC Barcelona 30
## 15 12 50 51 2018 FC Barcelona 31
## 16 3 44 31 2019 FC Barcelona 32
Another option is to use the relocate() function. This
has the same syntax as select, but has extra functionally for moving
columns with the .after and .before
arguments.
By default, relocate will move the column you specify to the first column.
# default moves to first column
messi_career %>%
relocate(Club)## Club Appearances Goals Season Age champLeagueGoal
## 1 FC Barcelona 9 1 2004 17 0
## 2 FC Barcelona 25 8 2005 18 1
## 3 FC Barcelona 36 17 2006 19 1
## 4 FC Barcelona 40 16 2007 20 6
## 5 FC Barcelona 51 38 2008 21 9
## 6 FC Barcelona 53 47 2009 22 8
## 7 FC Barcelona 55 53 2010 23 12
## 8 FC Barcelona 60 73 2011 24 14
## 9 FC Barcelona 50 60 2012 25 8
## 10 FC Barcelona 46 41 2013 26 8
## 11 FC Barcelona 57 58 2014 27 10
## 12 FC Barcelona 49 41 2015 28 6
## 13 FC Barcelona 52 54 2016 29 11
## 14 FC Barcelona 54 45 2017 30 6
## 15 FC Barcelona 50 51 2018 31 12
## 16 FC Barcelona 44 31 2019 32 3
We call .after and .before like the
examples below. We can also move more than one column.
# move club to col after champLeagueGoal
messi_career %>%
relocate(Club, .after = champLeagueGoal)## Appearances Goals Season Age champLeagueGoal Club
## 1 9 1 2004 17 0 FC Barcelona
## 2 25 8 2005 18 1 FC Barcelona
## 3 36 17 2006 19 1 FC Barcelona
## 4 40 16 2007 20 6 FC Barcelona
## 5 51 38 2008 21 9 FC Barcelona
## 6 53 47 2009 22 8 FC Barcelona
## 7 55 53 2010 23 12 FC Barcelona
## 8 60 73 2011 24 14 FC Barcelona
## 9 50 60 2012 25 8 FC Barcelona
## 10 46 41 2013 26 8 FC Barcelona
## 11 57 58 2014 27 10 FC Barcelona
## 12 49 41 2015 28 6 FC Barcelona
## 13 52 54 2016 29 11 FC Barcelona
## 14 54 45 2017 30 6 FC Barcelona
## 15 50 51 2018 31 12 FC Barcelona
## 16 44 31 2019 32 3 FC Barcelona
# move club to col before champLeagueGoal
messi_career %>%
relocate(Club, Goals, .before = champLeagueGoal)## Appearances Season Age Club Goals champLeagueGoal
## 1 9 2004 17 FC Barcelona 1 0
## 2 25 2005 18 FC Barcelona 8 1
## 3 36 2006 19 FC Barcelona 17 1
## 4 40 2007 20 FC Barcelona 16 6
## 5 51 2008 21 FC Barcelona 38 9
## 6 53 2009 22 FC Barcelona 47 8
## 7 55 2010 23 FC Barcelona 53 12
## 8 60 2011 24 FC Barcelona 73 14
## 9 50 2012 25 FC Barcelona 60 8
## 10 46 2013 26 FC Barcelona 41 8
## 11 57 2014 27 FC Barcelona 58 10
## 12 49 2015 28 FC Barcelona 41 6
## 13 52 2016 29 FC Barcelona 54 11
## 14 54 2017 30 FC Barcelona 45 6
## 15 50 2018 31 FC Barcelona 51 12
## 16 44 2019 32 FC Barcelona 31 3
Using the examples above:
year column to be the first column in the
imdb_sub data frameavg_vote column to be after the
year column# your code hereThe filter function allows you to subset rows based on conditions,
using conditional operators (==, <=, != etc.). It is similar to the
base r subset() function which we have used in previous R
workshops. The table below is a reminder of the conditional operators
you can use.
| Operator | Meaning |
|---|---|
> |
Greater than |
>= |
Greater than or equal to |
< |
Less than |
<= |
Less than or equal to |
== |
Equal to |
!= |
Not equal to |
!X |
NOT X |
X |
Y |
X & Y |
X AND Y |
X %in% Y |
is X in Y |
Just like when using select, you provide the column name
you want to apply conditional logic to. If you are piping, you don’t
need to provide your data as an argument in the function.
Run the examples below and review the outputs.
# filter based on one criteria
messi_career %>% filter(Goals > 50)## Appearances Goals Season Club Age champLeagueGoal
## 1 55 53 2010 FC Barcelona 23 12
## 2 60 73 2011 FC Barcelona 24 14
## 3 50 60 2012 FC Barcelona 25 8
## 4 57 58 2014 FC Barcelona 27 10
## 5 52 54 2016 FC Barcelona 29 11
## 6 50 51 2018 FC Barcelona 31 12
# filter then pipe to select
messi_career %>% filter(Appearances >= 55) %>%
select(Season, Age)## Season Age
## 1 2010 23
## 2 2011 24
## 3 2014 27
# filter on more than one condition
messi_career %>% filter(Goals > 50 & champLeagueGoal <= 10)## Appearances Goals Season Club Age champLeagueGoal
## 1 50 60 2012 FC Barcelona 25 8
## 2 57 58 2014 FC Barcelona 27 10
# filter on average
messi_career %>% filter(Goals > mean(Goals, na.rm = TRUE))## Appearances Goals Season Club Age champLeagueGoal
## 1 53 47 2009 FC Barcelona 22 8
## 2 55 53 2010 FC Barcelona 23 12
## 3 60 73 2011 FC Barcelona 24 14
## 4 50 60 2012 FC Barcelona 25 8
## 5 46 41 2013 FC Barcelona 26 8
## 6 57 58 2014 FC Barcelona 27 10
## 7 49 41 2015 FC Barcelona 28 6
## 8 52 54 2016 FC Barcelona 29 11
## 9 54 45 2017 FC Barcelona 30 6
## 10 50 51 2018 FC Barcelona 31 12
To assign the result to a new data frame (subset) we use the assignment operator at the beginning or the end of our code; here we have just shown the beginning, in the pipes section we show both versions.
# assign result to messi_sub
messi_sub <- messi_career %>%
filter(Appearances <= 40) %>%
select(Goals, Age)
# view result
messi_sub## Goals Age
## 1 1 17
## 2 8 18
## 3 17 19
## 4 16 20
We are going to filter our subsetted (imdb_sub) data to
find the best rated films from the USA in the year 1989, and create a
subset called USA_1989_high.
# your code hereYou might have noticed that the country column has some strings that
are split by a comma, e.g. “Germany, Denmark”. The == operator will not
be able to pick these up. Instead we would use the base R
grepl() function or str_detect() from the
stringr package. This won’t be covered in this workshop,
but will be in future workshops. If you are interested, have a look at
the stringr package - https://stringr.tidyverse.org/index.html.
Other than conditional subsetting of data using
filter(), dplyr has other functions we can use to subset
our data: slice, sample, and
distinct.
The sample functions randomly extract a set number of rows from your
data. This is helpful if you want to take a random sample of your
dataset. The examples below show the sample_n() and
sample_frac() functions.
# sample 5 rows
messi_career %>%
sample_n(5)## Appearances Goals Season Club Age champLeagueGoal
## 1 53 47 2009 FC Barcelona 22 8
## 2 60 73 2011 FC Barcelona 24 14
## 3 52 54 2016 FC Barcelona 29 11
## 4 51 38 2008 FC Barcelona 21 9
## 5 50 60 2012 FC Barcelona 25 8
# sample 25% of your data
messi_career %>%
sample_frac(0.25)## Appearances Goals Season Club Age champLeagueGoal
## 1 52 54 2016 FC Barcelona 29 11
## 2 36 17 2006 FC Barcelona 19 1
## 3 9 1 2004 FC Barcelona 17 0
## 4 55 53 2010 FC Barcelona 23 12
The slice functions are more useful. The basic slice
function is the equivalent of using numbered indexing in base r
data[1:5, ], but is designed to work better in the
tidyverse enviroment.
# select rows 4, 5, and 6
messi_career %>%
slice(4:6)## Appearances Goals Season Club Age champLeagueGoal
## 1 40 16 2007 FC Barcelona 20 6
## 2 51 38 2008 FC Barcelona 21 9
## 3 53 47 2009 FC Barcelona 22 8
# equivalent in base r
messi_career[4:6, ]## Appearances Goals Season Club Age champLeagueGoal
## 4 40 16 2007 FC Barcelona 20 6
## 5 51 38 2008 FC Barcelona 21 9
## 6 53 47 2009 FC Barcelona 22 8
The slice_max and slice_min functions are
much more powerful, and are harder and messier to achieve with normal
base r code. They allow you to index the rows that have the max (or min)
in a specified column. In the example, we extract the rows that have the
top three and bottom three values in the Goals column.
# extract rows with top three Goals
messi_career %>%
slice_max(Goals, n = 3)## Appearances Goals Season Club Age champLeagueGoal
## 1 60 73 2011 FC Barcelona 24 14
## 2 50 60 2012 FC Barcelona 25 8
## 3 57 58 2014 FC Barcelona 27 10
# this harder and less clear in base r
messi_career[messi_career$Goals %in% tail(sort(messi_career$Goals), 3), ]## Appearances Goals Season Club Age champLeagueGoal
## 8 60 73 2011 FC Barcelona 24 14
## 9 50 60 2012 FC Barcelona 25 8
## 11 57 58 2014 FC Barcelona 27 10
# extract rows with bottom three Goals
messi_career %>%
slice_min(Goals, n = 3)## Appearances Goals Season Club Age champLeagueGoal
## 1 9 1 2004 FC Barcelona 17 0
## 2 25 8 2005 FC Barcelona 18 1
## 3 40 16 2007 FC Barcelona 20 6
In this exercise you will need to debug my code to get it working. We will filter the imdb_sub data for films over 120 minutes, and in the USA, then extract the top 20 rated films.
If you get it working your top_votes_USA data frame
should have 20 rows and 4 columns (title, year, genre and avg_vote) with
films such as The Shawshank Redemption and the
Godfather. As a bonus, if you get your code working, the plot at
the end of the code will run!
# your code here
top_votes_USA <- imdb_sub %>%
filter(duration >= 120 & country = "USA") |>
slicemax(avgvote, n = 20) %>%
select(title year, genre, avg_vote)
top_votes_USA
# fun extra, plot the output of your debugging!
plot(top_votes_USA$year, top_votes_USA$avg_vote,
col = "orange", # point colour
pch = 16, # point type
cex = 1.5, # point size
xlab = "Year",
ylab = "Average vote") For this coding challenge we are going to extract all Tolkien (lord of the rings and hobbit) and Harry Potter films from our imdb dataset. We have provided vectors with the titles of these films.
%in%
operator to filter titles in the imdb dataset that match the Tolkien or
Potter vectors.hint: for 4 and 5 you can use filter to compare the column to the mean of that column, e.g. filter(data, column > mean(column))
Tolkien <- c("The Lord of the Rings: The Fellowship of the Ring", "The Lord of the Rings: The Return of the King",
"The Lord of the Rings: The Two Towers", "The Hobbit: An Unexpected Journey",
"The Hobbit: The Desolation of Smaug", "The Hobbit: The Battle of the Five Armies")
Potter <- c("Harry Potter and the Sorcerer's Stone", "Harry Potter and the Chamber of Secrets",
"Harry Potter and the Prisoner of Azkaban", "Harry Potter and the Goblet of Fire",
"Harry Potter and the Order of the Phoenix", "Harry Potter and the Half-Blood Prince",
"Harry Potter and the Deathly Hallows: Part 1", "Harry Potter and the Deathly Hallows: Part 2")
# your code hereTo manipulate and create new columns using the mutate function from dplyr, as well as cleaning column names.
In this workshop, the aim is to cover how to perform data wrangling tasks on columns using dplyr. We will be covering:
The mutate function is from the dplyr library, and is for making, modifying, or deleting columns in your dataset. Similar to what we have done in previous sessions, mutate allows you to make a new column from a calculation you have made.
The main difference between using mutate and making new columns in base R, is that mutate is smarter. You can create a new column based on a new column you have just made within mutate, which you can’t do in base R. Lets look at some examples with our messi data we used in the last session.
In our previous workshops, we calculated Messi’s goals per game
(goals/appearances). We can do this with mutate. Notice the syntax, we
give the name we want to call our new column first, then =, then what we
want to do (e.g. a calculation);
mutate(new_column = x/y).
note: when loading dplyr you also load the magrittr library for piping
# load dplyr
library(dplyr)
# create the messi career data
messi_career <- data.frame(Appearances = c(9,25,36,40,51,53,55,60,50,46,57,49,52,54,50,44),
Goals = c(1,8,17,16,38,47,53,73,60,41,58,41,54,45,51,31),
Season = c(2004,2005,2006,2007,2008,2009,2010,2011,2012,
2013,2014,2015,2016,2017,2018,2019),
Club = rep("FC Barcelona", 16),
Age = seq(17, 32),
champLeagueGoal = c(0,1,1,6,9,8,12,14,8,8,10,6,11,6,12,3))
# view the data
head(messi_career)## Appearances Goals Season Club Age champLeagueGoal
## 1 9 1 2004 FC Barcelona 17 0
## 2 25 8 2005 FC Barcelona 18 1
## 3 36 17 2006 FC Barcelona 19 1
## 4 40 16 2007 FC Barcelona 20 6
## 5 51 38 2008 FC Barcelona 21 9
## 6 53 47 2009 FC Barcelona 22 8
# calculate the goal to appearance ratio
messi_career %>%
mutate(goal_ratio = Goals/Appearances)## Appearances Goals Season Club Age champLeagueGoal goal_ratio
## 1 9 1 2004 FC Barcelona 17 0 0.1111111
## 2 25 8 2005 FC Barcelona 18 1 0.3200000
## 3 36 17 2006 FC Barcelona 19 1 0.4722222
## 4 40 16 2007 FC Barcelona 20 6 0.4000000
## 5 51 38 2008 FC Barcelona 21 9 0.7450980
## 6 53 47 2009 FC Barcelona 22 8 0.8867925
## 7 55 53 2010 FC Barcelona 23 12 0.9636364
## 8 60 73 2011 FC Barcelona 24 14 1.2166667
## 9 50 60 2012 FC Barcelona 25 8 1.2000000
## 10 46 41 2013 FC Barcelona 26 8 0.8913043
## 11 57 58 2014 FC Barcelona 27 10 1.0175439
## 12 49 41 2015 FC Barcelona 28 6 0.8367347
## 13 52 54 2016 FC Barcelona 29 11 1.0384615
## 14 54 45 2017 FC Barcelona 30 6 0.8333333
## 15 50 51 2018 FC Barcelona 31 12 1.0200000
## 16 44 31 2019 FC Barcelona 32 3 0.7045455
The new column, goal_ratio in this case, will automatically be added to the end of your data frame. This is the same behaviour you will see when using base R. This behaviour can be altered if you want, but we won’t have time to cover it here.
What makes mutate() powerful, is the ability to do
multiple calculations in one statement, as well as using newly made
columns. See the below example which will help to understand this. We
will use goal_ratio to find out the difference between goal_ratio and
the average goal ratio for each row (or season).
# calculate goal ratio and diff from mean
messi_career <- messi_career %>%
mutate(
goal_ratio = round(Goals/Appearances, digits = 2),
diff_avg_goal_ratio = goal_ratio - (mean(Goals) / mean(Appearances)))
# print result
messi_career## Appearances Goals Season Club Age champLeagueGoal goal_ratio
## 1 9 1 2004 FC Barcelona 17 0 0.11
## 2 25 8 2005 FC Barcelona 18 1 0.32
## 3 36 17 2006 FC Barcelona 19 1 0.47
## 4 40 16 2007 FC Barcelona 20 6 0.40
## 5 51 38 2008 FC Barcelona 21 9 0.75
## 6 53 47 2009 FC Barcelona 22 8 0.89
## 7 55 53 2010 FC Barcelona 23 12 0.96
## 8 60 73 2011 FC Barcelona 24 14 1.22
## 9 50 60 2012 FC Barcelona 25 8 1.20
## 10 46 41 2013 FC Barcelona 26 8 0.89
## 11 57 58 2014 FC Barcelona 27 10 1.02
## 12 49 41 2015 FC Barcelona 28 6 0.84
## 13 52 54 2016 FC Barcelona 29 11 1.04
## 14 54 45 2017 FC Barcelona 30 6 0.83
## 15 50 51 2018 FC Barcelona 31 12 1.02
## 16 44 31 2019 FC Barcelona 32 3 0.70
## diff_avg_goal_ratio
## 1 -0.75730506
## 2 -0.54730506
## 3 -0.39730506
## 4 -0.46730506
## 5 -0.11730506
## 6 0.02269494
## 7 0.09269494
## 8 0.35269494
## 9 0.33269494
## 10 0.02269494
## 11 0.15269494
## 12 -0.02730506
## 13 0.17269494
## 14 -0.03730506
## 15 0.15269494
## 16 -0.16730506
We can then pipe this result to filter(), which allows
us to see which seasons Messi has a goal ratio above his average goal
ratio.
messi_career %>%
mutate(
goal_ratio = round(Goals/Appearances, digits = 2),
diff_avg_goal_ratio = goal_ratio - (mean(Goals) / mean(Appearances))) %>%
filter(diff_avg_goal_ratio > 0)## Appearances Goals Season Club Age champLeagueGoal goal_ratio
## 1 53 47 2009 FC Barcelona 22 8 0.89
## 2 55 53 2010 FC Barcelona 23 12 0.96
## 3 60 73 2011 FC Barcelona 24 14 1.22
## 4 50 60 2012 FC Barcelona 25 8 1.20
## 5 46 41 2013 FC Barcelona 26 8 0.89
## 6 57 58 2014 FC Barcelona 27 10 1.02
## 7 52 54 2016 FC Barcelona 29 11 1.04
## 8 50 51 2018 FC Barcelona 31 12 1.02
## diff_avg_goal_ratio
## 1 0.02269494
## 2 0.09269494
## 3 0.35269494
## 4 0.33269494
## 5 0.02269494
## 6 0.15269494
## 7 0.17269494
## 8 0.15269494
We will be using the imdb movies dataset again for this workshop. Use the code below to load in the data.
# load libraries
library(readr)
library(dplyr)
# load data
movies_imdb <- read_csv("https://raw.githubusercontent.com/andrewmoles2/rTrainIntroduction/main/r-data-wrangling-1/data/IMDb%20movies.csv")
# use glimpse to review data (tidyverse version of str())
movies_imdb %>% glimpse()## Rows: 85,855
## Columns: 21
## $ imdb_title_id <chr> "tt0000009", "tt0000574", "tt0001892", "tt000210…
## $ title <chr> "Miss Jerry", "The Story of the Kelly Gang", "De…
## $ year <dbl> 1894, 1906, 1911, 1912, 1911, 1912, 1919, 1913, …
## $ date_published <chr> "1894-10-09", "26/12/1906", "19/08/1911", "13/11…
## $ genre <chr> "Romance", "Biography, Crime, Drama", "Drama", "…
## $ duration <dbl> 45, 70, 53, 100, 68, 60, 85, 120, 120, 55, 121, …
## $ country <chr> "USA", "Australia", "Germany, Denmark", "USA", "…
## $ language <chr> "None", "None", NA, "English", "Italian", "Engli…
## $ director <chr> "Alexander Black", "Charles Tait", "Urban Gad", …
## $ writer <chr> "Alexander Black", "Charles Tait", "Urban Gad, G…
## $ production_company <chr> "Alexander Black Photoplays", "J. and N. Tait", …
## $ actors <chr> "Blanche Bayliss, William Courtenay, Chauncey De…
## $ description <chr> "The adventures of a female reporter in the 1890…
## $ avg_vote <dbl> 5.9, 6.1, 5.8, 5.2, 7.0, 5.7, 6.8, 6.2, 6.7, 5.5…
## $ votes <dbl> 154, 589, 188, 446, 2237, 484, 753, 273, 198, 22…
## $ budget <chr> NA, "$ 2250", NA, "$ 45000", NA, NA, NA, "ITL 45…
## $ usa_gross_income <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ worlwide_gross_income <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ metascore <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ reviews_from_users <dbl> 1, 7, 5, 25, 31, 13, 12, 7, 4, 8, 9, 9, 16, 8, N…
## $ reviews_from_critics <dbl> 2, 7, 2, 3, 14, 5, 9, 5, 1, 1, 9, 28, 7, 23, 4, …
Lets pretend we are interested in the difference between the number of user reviews and critic reviews for each film in our movies_imdb dataset. We can use mutate to explore this difference a bit further.
mutate() function. Make
a new column called user_critic_ratio, and divide
reviews_from_users by reviews_from_critics.
Wrap the result in a round() function, rounding by two
digitsfilter() function, filtering country to
be USA and year to be 1989select() function, selecting the title,
avg_vote and user_critic_ratio columnsslice_max function, extracting rows that
had the top 10 avg_ratingYou should get a data frame returned that has films including: The Abyss, Dead Poets Society, Do the Right Thing, and Glory.
# your code hereWe can see we get more user reviews than critic reviews, which makes sense; for example, the The Abyss has 4 user reviews for each critic review.
In our second mutate exercise, you will need to de-bug the code to get it running! You may need to re-order some elements of the code as well as checking for other errors.
We are filtering the movies_imdb data for films that are from the USA before the year 1990, have a duration less than 120 minutes, and an average vote greater than 8.5. We will also be using the user_critic_ratio column to make it into a string for easier reading.
You should end up with a data frame with 6 rows, and 4 columns (title, year, avg_vote, and ratio_string). The final column, ratio_string, should have an output like “Psycho has a user to critic ratio of 5.44”.
# your code here
usa_pre90_high <- movies_imdb |>
mutate(user_critic_ratio = round(reviews_from_users / reviews_from_critics, digits = 2),
ratio_string = paste(title, "has a user to critic ratio of", userCriticRatio)) %>%
filter(country == "USA" & year < 1990)
select(title, year, avg_vote, ratio_string) %>%
filter(duration < 120 & avg_vote >= 8.5)
usa_pre90_highWe can take the mutate function further by using the
across() function. This allows us to perform operations (do
something) across multiple columns. This is very useful for doing type
conversions in an efficient way.
The across function works in a similar way to the
select() function, but if you want to pick out a few
columns you have to use the c() function. See the examples
below, where we have selected two columns, or used a slice to select out
a few columns that are next to each other.
# perform round (to 1 decimal place) across selected columns
messi_career %>%
mutate(across(c(goal_ratio, diff_avg_goal_ratio), round, digits = 1))## Appearances Goals Season Club Age champLeagueGoal goal_ratio
## 1 9 1 2004 FC Barcelona 17 0 0.1
## 2 25 8 2005 FC Barcelona 18 1 0.3
## 3 36 17 2006 FC Barcelona 19 1 0.5
## 4 40 16 2007 FC Barcelona 20 6 0.4
## 5 51 38 2008 FC Barcelona 21 9 0.8
## 6 53 47 2009 FC Barcelona 22 8 0.9
## 7 55 53 2010 FC Barcelona 23 12 1.0
## 8 60 73 2011 FC Barcelona 24 14 1.2
## 9 50 60 2012 FC Barcelona 25 8 1.2
## 10 46 41 2013 FC Barcelona 26 8 0.9
## 11 57 58 2014 FC Barcelona 27 10 1.0
## 12 49 41 2015 FC Barcelona 28 6 0.8
## 13 52 54 2016 FC Barcelona 29 11 1.0
## 14 54 45 2017 FC Barcelona 30 6 0.8
## 15 50 51 2018 FC Barcelona 31 12 1.0
## 16 44 31 2019 FC Barcelona 32 3 0.7
## diff_avg_goal_ratio
## 1 -0.8
## 2 -0.5
## 3 -0.4
## 4 -0.5
## 5 -0.1
## 6 0.0
## 7 0.1
## 8 0.4
## 9 0.3
## 10 0.0
## 11 0.2
## 12 0.0
## 13 0.2
## 14 0.0
## 15 0.2
## 16 -0.2
# square root across columns selected with slice
messi_career %>%
mutate(across(1:3, sqrt))## Appearances Goals Season Club Age champLeagueGoal goal_ratio
## 1 3.000000 1.000000 44.76606 FC Barcelona 17 0 0.11
## 2 5.000000 2.828427 44.77723 FC Barcelona 18 1 0.32
## 3 6.000000 4.123106 44.78839 FC Barcelona 19 1 0.47
## 4 6.324555 4.000000 44.79955 FC Barcelona 20 6 0.40
## 5 7.141428 6.164414 44.81071 FC Barcelona 21 9 0.75
## 6 7.280110 6.855655 44.82187 FC Barcelona 22 8 0.89
## 7 7.416198 7.280110 44.83302 FC Barcelona 23 12 0.96
## 8 7.745967 8.544004 44.84417 FC Barcelona 24 14 1.22
## 9 7.071068 7.745967 44.85532 FC Barcelona 25 8 1.20
## 10 6.782330 6.403124 44.86647 FC Barcelona 26 8 0.89
## 11 7.549834 7.615773 44.87761 FC Barcelona 27 10 1.02
## 12 7.000000 6.403124 44.88875 FC Barcelona 28 6 0.84
## 13 7.211103 7.348469 44.89989 FC Barcelona 29 11 1.04
## 14 7.348469 6.708204 44.91102 FC Barcelona 30 6 0.83
## 15 7.071068 7.141428 44.92215 FC Barcelona 31 12 1.02
## 16 6.633250 5.567764 44.93328 FC Barcelona 32 3 0.70
## diff_avg_goal_ratio
## 1 -0.75730506
## 2 -0.54730506
## 3 -0.39730506
## 4 -0.46730506
## 5 -0.11730506
## 6 0.02269494
## 7 0.09269494
## 8 0.35269494
## 9 0.33269494
## 10 0.02269494
## 11 0.15269494
## 12 -0.02730506
## 13 0.17269494
## 14 -0.03730506
## 15 0.15269494
## 16 -0.16730506
# square root across columns selected with slice (using col names)
messi_career %>%
mutate(across(Appearances:Season, sqrt))## Appearances Goals Season Club Age champLeagueGoal goal_ratio
## 1 3.000000 1.000000 44.76606 FC Barcelona 17 0 0.11
## 2 5.000000 2.828427 44.77723 FC Barcelona 18 1 0.32
## 3 6.000000 4.123106 44.78839 FC Barcelona 19 1 0.47
## 4 6.324555 4.000000 44.79955 FC Barcelona 20 6 0.40
## 5 7.141428 6.164414 44.81071 FC Barcelona 21 9 0.75
## 6 7.280110 6.855655 44.82187 FC Barcelona 22 8 0.89
## 7 7.416198 7.280110 44.83302 FC Barcelona 23 12 0.96
## 8 7.745967 8.544004 44.84417 FC Barcelona 24 14 1.22
## 9 7.071068 7.745967 44.85532 FC Barcelona 25 8 1.20
## 10 6.782330 6.403124 44.86647 FC Barcelona 26 8 0.89
## 11 7.549834 7.615773 44.87761 FC Barcelona 27 10 1.02
## 12 7.000000 6.403124 44.88875 FC Barcelona 28 6 0.84
## 13 7.211103 7.348469 44.89989 FC Barcelona 29 11 1.04
## 14 7.348469 6.708204 44.91102 FC Barcelona 30 6 0.83
## 15 7.071068 7.141428 44.92215 FC Barcelona 31 12 1.02
## 16 6.633250 5.567764 44.93328 FC Barcelona 32 3 0.70
## diff_avg_goal_ratio
## 1 -0.75730506
## 2 -0.54730506
## 3 -0.39730506
## 4 -0.46730506
## 5 -0.11730506
## 6 0.02269494
## 7 0.09269494
## 8 0.35269494
## 9 0.33269494
## 10 0.02269494
## 11 0.15269494
## 12 -0.02730506
## 13 0.17269494
## 14 -0.03730506
## 15 0.15269494
## 16 -0.16730506
We can also combine the across function with the where()
or all_of() functions to perform conditional mutations.
The where() function does conditional matching between
the statement you’ve used and what is in your dataset. In the example we
are asking where() to look for columns that are the
character (string) data type. Then we can perform an operation, such as
convert those columns to factors. In this case it is just the Club
column that changes.
# perform conditional operation with where
messi_career %>%
mutate(across(where(is.character), as.factor)) %>%
glimpse()## Rows: 16
## Columns: 8
## $ Appearances <dbl> 9, 25, 36, 40, 51, 53, 55, 60, 50, 46, 57, 49, 52,…
## $ Goals <dbl> 1, 8, 17, 16, 38, 47, 53, 73, 60, 41, 58, 41, 54, …
## $ Season <dbl> 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 20…
## $ Club <fct> FC Barcelona, FC Barcelona, FC Barcelona, FC Barce…
## $ Age <int> 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29…
## $ champLeagueGoal <dbl> 0, 1, 1, 6, 9, 8, 12, 14, 8, 8, 10, 6, 11, 6, 12, 3
## $ goal_ratio <dbl> 0.11, 0.32, 0.47, 0.40, 0.75, 0.89, 0.96, 1.22, 1.…
## $ diff_avg_goal_ratio <dbl> -0.75730506, -0.54730506, -0.39730506, -0.46730506…
The all_of() function looks for matches between the
strings you have provided and the column names in your dataset. In our
example, we put the Season and Club columns into a vector, then call
that vector and convert those columns to a factor.
# change selected variables with all_of
to_factor <- c("Season", "Club")
messi_career %>%
mutate(across(all_of(to_factor), as.factor)) %>%
glimpse()## Rows: 16
## Columns: 8
## $ Appearances <dbl> 9, 25, 36, 40, 51, 53, 55, 60, 50, 46, 57, 49, 52,…
## $ Goals <dbl> 1, 8, 17, 16, 38, 47, 53, 73, 60, 41, 58, 41, 54, …
## $ Season <fct> 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 20…
## $ Club <fct> FC Barcelona, FC Barcelona, FC Barcelona, FC Barce…
## $ Age <int> 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29…
## $ champLeagueGoal <dbl> 0, 1, 1, 6, 9, 8, 12, 14, 8, 8, 10, 6, 11, 6, 12, 3
## $ goal_ratio <dbl> 0.11, 0.32, 0.47, 0.40, 0.75, 0.89, 0.96, 1.22, 1.…
## $ diff_avg_goal_ratio <dbl> -0.75730506, -0.54730506, -0.39730506, -0.46730506…
Lets go back to our movies_imdb data. We want to extract films from 1990 through to 1995, that are from the USA, and have an avg_vote greater than or equal to 7.5. We also want all our variables that are currently characters to be factors, and want the year column to also be a factor.
usa_early90_highusa_early90_high subset, filter for
avg_vote greater than or equal to 8.5, then select the title, avg_vote,
and year columns. View the result to see the top rated films and what
year they were in.# your code hereIt can sometimes be helpful to rank your dataset, using mutate and
the min_rank() or percent_rank functions allow
you to add a new column with a rank based on a important column. Higher
rank or percent rank means a better ranking.
In this example, we want to make a goal ranking column and a percent raking column. We can then use filter to select rankings we are interested in.
messi_career <- messi_career %>%
mutate(goal_rank = min_rank(Goals),
goal_perc_rank = percent_rank(Goals))
# select rankings over 10
messi_career %>%
filter(goal_rank > 10)## Appearances Goals Season Club Age champLeagueGoal goal_ratio
## 1 55 53 2010 FC Barcelona 23 12 0.96
## 2 60 73 2011 FC Barcelona 24 14 1.22
## 3 50 60 2012 FC Barcelona 25 8 1.20
## 4 57 58 2014 FC Barcelona 27 10 1.02
## 5 52 54 2016 FC Barcelona 29 11 1.04
## 6 50 51 2018 FC Barcelona 31 12 1.02
## diff_avg_goal_ratio goal_rank goal_perc_rank
## 1 0.09269494 12 0.7333333
## 2 0.35269494 16 1.0000000
## 3 0.33269494 15 0.9333333
## 4 0.15269494 14 0.8666667
## 5 0.17269494 13 0.8000000
## 6 0.15269494 11 0.6666667
Another useful calculation you can do is to do cumulativate
calculations, such as cumulativate sum or mean of a useful variable. For
example, in our messi_career data it might be interesting to workout his
cumulativate goals, and average cumulativate appearances. We use the
cumsum() and cummean() functions for these
calculations.
note: cumulativate calculations are work very well with longitudinal data, like we have for Lionel Messi’s career goals and appearances
messi_career %>%
mutate(cumul_goals = cumsum(Goals),
mean_cumul_app = cummean(Appearances)) %>%
select(Goals, cumul_goals, Appearances, mean_cumul_app)## Goals cumul_goals Appearances mean_cumul_app
## 1 1 1 9 9.00000
## 2 8 9 25 17.00000
## 3 17 26 36 23.33333
## 4 16 42 40 27.50000
## 5 38 80 51 32.20000
## 6 47 127 53 35.66667
## 7 53 180 55 38.42857
## 8 73 253 60 41.12500
## 9 60 313 50 42.11111
## 10 41 354 46 42.50000
## 11 58 412 57 43.81818
## 12 41 453 49 44.25000
## 13 54 507 52 44.84615
## 14 45 552 54 45.50000
## 15 51 603 50 45.80000
## 16 31 634 44 45.68750
Using your usa_early90_high data we just made in the last exercise:
duration_rank,
using the min_rank() function on the duration columnperc_duration_rank, using the percent_rank()
function on the duration columnavg_cumul_duration, using the cummean()
function on duration.# your code hereThe transmute() function in dplyr works in a similar way
to mutate(), but it drops all columns except those
it has just made.
# use transmutate
messi_career %>%
transmute(cumul_goals = cumsum(Goals),
mean_cumul_app = cummean(Appearances))## cumul_goals mean_cumul_app
## 1 1 9.00000
## 2 9 17.00000
## 3 26 23.33333
## 4 42 27.50000
## 5 80 32.20000
## 6 127 35.66667
## 7 180 38.42857
## 8 253 41.12500
## 9 313 42.11111
## 10 354 42.50000
## 11 412 43.81818
## 12 453 44.25000
## 13 507 44.84615
## 14 552 45.50000
## 15 603 45.80000
## 16 634 45.68750
The behaviour of transmute can be helpful in certain situations, but if you really want to keep some columns, you can add them into the transmute statement. For example, in the example below I might want to keep the Goals and Appearances columns for comparison with the cumulativate calculations I’ve made.
# keep Goals and Appearances
messi_career %>%
transmute(cumul_goals = cumsum(Goals),
mean_cumul_app = cummean(Appearances),
Goals,
Appearances)## cumul_goals mean_cumul_app Goals Appearances
## 1 1 9.00000 1 9
## 2 9 17.00000 8 25
## 3 26 23.33333 17 36
## 4 42 27.50000 16 40
## 5 80 32.20000 38 51
## 6 127 35.66667 47 53
## 7 180 38.42857 53 55
## 8 253 41.12500 73 60
## 9 313 42.11111 60 50
## 10 354 42.50000 41 46
## 11 412 43.81818 58 57
## 12 453 44.25000 41 49
## 13 507 44.84615 54 52
## 14 552 45.50000 45 54
## 15 603 45.80000 51 50
## 16 634 45.68750 31 44
Let’s use transmute to look at the durations of the films in the imdb_movies data.
transmute()transmute() make a variable called
duration_rank, and use the min_rank() function on
durationfilter(), slice_max() or
slice_min(), find out the top 4 and bottom 4 film
durations# your code hereChanging column names is a very useful part of data science. Sometimes you’ll get a dataset with column names that are not very meaningful, or far too long. There are a few methods for changing column names, with the easiest being the tidyverse solution.
The first step in changing column names is viewing what the names
are! Two functions in R exist for this: colnames() and
names(). They do the same thing…so I prefer
names() as it is less typing.
# view a datasets column names
names(messi_career)## [1] "Appearances" "Goals" "Season"
## [4] "Club" "Age" "champLeagueGoal"
## [7] "goal_ratio" "diff_avg_goal_ratio" "goal_rank"
## [10] "goal_perc_rank"
The non-tidyverse way of changing column names is to use the
names() function. If you are changing one column you use
indexing using [], and multiple columns you use `c().
# Make a data frame
df <- data.frame(
column1 = rep("Hello", 4),
column2 = sample(1:10, 4),
column3 = seq(1:4),
integer = 4:7,
factor = factor(c("dog", "cat", "cat", "dog"))
)
df## column1 column2 column3 integer factor
## 1 Hello 7 1 4 dog
## 2 Hello 3 2 5 cat
## 3 Hello 5 3 6 cat
## 4 Hello 6 4 7 dog
# change multiple columns using names
names(df) <- c("string", "random", "sequence", "integer", "factor")
names(df)## [1] "string" "random" "sequence" "integer" "factor"
# using names and number index
names(df)[1] <- "a_string"
names(df)## [1] "a_string" "random" "sequence" "integer" "factor"
# using logic and names
names(df)[names(df) == "sequence"] <- "its_a_sequence"
names(df)## [1] "a_string" "random" "its_a_sequence" "integer"
## [5] "factor"
The main issue with these techniques is 1) it can get really messy if you need to rename lots of columns in a larger dataset. 2) I have to rename all my columns if I need to rename more than one column, otherwise it doesn’t work! 3) The syntax is a bit messy, especially the last example.
The rename() function from dplyr allows for simple
changing of column names with no fuss, and solves these problems.
The syntax is the same as the mutate() function, where
we have the name of the column we want to make, then what column we are
changing:
data %>% rename(new_column_name = old_column_name).
# load dplyr
library(dplyr)
# Make a data frame
df <- data.frame(
column1 = rep("Hello", 4),
column2 = sample(1:10, 4),
column3 = seq(1:4),
integer = 4:7,
factor = factor(c("dog", "cat", "cat", "dog"))
)
names(df)## [1] "column1" "column2" "column3" "integer" "factor"
# rename columns that need renaming
df_new_col <- df %>%
rename(string = column1,
random = column2,
sequence = column3)
df_new_col## string random sequence integer factor
## 1 Hello 6 1 4 dog
## 2 Hello 10 2 5 cat
## 3 Hello 1 3 6 cat
## 4 Hello 5 4 7 dog
Let’s have a practice renaming some columns in the movies_imdb dataset.
names(movies_imdb) to get the column
names of your dataset. This is a nice way to finding the column names,
making it easy to copy and paste the names should you need torename() function from dplyr, change
reviews_from_users to User_reviews and
reviews_from_critics to Critic_reviewsmovies_imdbnames(movies_imdb) again to view the
new column names# your code hereSometimes you have a dataset that has messy or ugly column names, which would take some time to tidy up manually. As usual with R there is a package for that situation; which happens more often than you think!
First, we need to install the janitor library.
# run to install janitor
install.packages("janitor")A simple example is below. We have a data frame with inconsistent
column names. We use the clean_names() function from
janitor to tidy up the column names.
The output shows the difference between default R behaviour and how janitor has cleaned the names. As you can see the janitor output is consistent and in “snake_case” format.
# load janitor
library(janitor)##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# make an example data frame
messy_cols <- data.frame(
'messyCol *1' = seq(1:5),
'messy.col 2' = seq(1:5),
'MESSY.COL 3' = seq(1:5),
'messy.col (4)' = seq(1:5)
)
# compare default to janitor col names
data_frame(default = names(messy_cols),
janitor = names(clean_names(messy_cols)))## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 4 × 2
## default janitor
## <chr> <chr>
## 1 messyCol..1 messy_col_1
## 2 messy.col.2 messy_col_2
## 3 MESSY.COL.3 messy_col_3
## 4 messy.col..4. messy_col_4
The janitor library is designed to be used with the tidyverse, so
when loading in data, we can pipe our loaded data straight into the
clean_names() function form janitor.
# pipe data to clean names
messy_cols <- data.frame(
'messyCol *1' = seq(1:5),
'messy.col 2' = seq(1:5),
'MESSY.COL 3' = seq(1:5),
'messy.col (4)' = seq(1:5)
) %>% clean_names()
# view col names
names(messy_cols)## [1] "messy_col_1" "messy_col_2" "messy_col_3" "messy_col_4"
You can change the default style, or case, of
clean_names() from snake case to another if you need or
want to. See some examples below.
# lower camel case
data.frame(
'messyCol *1' = seq(1:5),
'messy.col 2' = seq(1:5),
'MESSY.COL 3' = seq(1:5),
'messy.col (4)' = seq(1:5)
) %>% clean_names(case = "lower_camel")## messyCol1 messyCol2 messyCol3 messyCol4
## 1 1 1 1 1
## 2 2 2 2 2
## 3 3 3 3 3
## 4 4 4 4 4
## 5 5 5 5 5
# title case
# This is useful for plotting or tables
data.frame(
'messyCol *1' = seq(1:5),
'messy.col 2' = seq(1:5),
'MESSY.COL 3' = seq(1:5),
'messy.col (4)' = seq(1:5)
) %>% clean_names(case = "title") ## Messy Col 1 Messy Col 2 Messy Col 3 Messy Col 4
## 1 1 1 1 1
## 2 2 2 2 2
## 3 3 3 3 3
## 4 4 4 4 4
## 5 5 5 5 5
# all_caps case
data.frame(
'messyCol *1' = seq(1:5),
'messy.col 2' = seq(1:5),
'MESSY.COL 3' = seq(1:5),
'messy.col (4)' = seq(1:5)
) %>% clean_names(case = "all_caps") ## MESSY_COL_1 MESSY_COL_2 MESSY_COL_3 MESSY_COL_4
## 1 1 1 1 1
## 2 2 2 2 2
## 3 3 3 3 3
## 4 4 4 4 4
## 5 5 5 5 5
A full list of what different cases are available are on this page under the case arguments: https://rdrr.io/cran/snakecase/man/to_any_case.html
Finally, you can decide if you want the numbers (if you have any) to
be aligned in the left, right, or middle of the column name. By default
clean_names() puts numbers to the middle/right. To change
this behaviour we use the numerals argument and specify left as shown
below.
data.frame(
'messyCol *1' = seq(1:5),
'messy.col 2' = seq(1:5),
'MESSY.COL 3' = seq(1:5),
'messy.col (4)' = seq(1:5)
) %>%
clean_names(numerals = "left") ## messy_col1 messy_col2 messy_col3 messy_col4
## 1 1 1 1 1
## 2 2 2 2 2
## 3 3 3 3 3
## 4 4 4 4 4
## 5 5 5 5 5
As the movies_imdb data we are using already has cleaned names, we will load in another dataset as an example: the pokemon dataset we have used in previous workshops.
janitor and readr
librarysread_csv() to load in the pokemon dataset from this
link <“https://raw.githubusercontent.com/andrewmoles2/rTrainIntroduction/main/r-fundamentals-5/data/pokemonGen1.csv”>.
Call your data pokemonread_csv() to load in the same pokemon dataset from
the link, but this time pipe to clean_names(). Call this
dataset pokemon_cleanedclean_names() function, change the case used. Call this
dataset pokemon_cleaned2data.frame() function.
Make your first column default = names(pokemon), second
column cleaned = names(pokemon_cleaned), and your last
column cleaned2 = names(pokemon_cleaned_2). Run the code to
review the outputDifferent cases available can be found at this link: https://rdrr.io/cran/snakecase/man/to_any_case.html
# your code hereWe would be grateful if you could take a minute before the end of the workshop so we can get your feedback!
The solutions we be available from a link at the end of the survey.
In this coding challenge we will try and put together what we have learned in this and previous workshops.
We will be using data from the pokemon games, making some subsets from that data. If you are curious about the data, have a look at the source here: https://pokemondb.net/pokedex/all.
pokemonpokemon to a factormin_rank() function on speed and hp to calculate the
rankingspokemon_500pokemon_500 data to slice_max or slice_min
functions to find the top 10 fastest/slowest pokemon, and the top 10
highest/lowest hp pokemon. For example,
slow <- pokemon_500 %>% slice_min(speed_rank, n = 10)%in% operatorpokemon_500 data you made to see which pokemon types have
total statistics over 500. The colours represent each pokemon type
(grass is green etc.). It won’t run if pokemon_500 has not
been made or named differently.# your code hereBonus code (see part 12 of coding challenge)
# bonus - see a bar plot of your pokemon types
# make a colour palette of the pokemon types
colour <- c("#6a8b5a", "#414152", "#5a8bee",
"#f6e652","#ffd5bd", "#b40000",
"#ee8329","#6ab4e6", "#8b6283", "#20b49c",
"#c57341", "#e6e6f6", "#ffffff",
"#a483c5", "#f65273", "#e6d5ac",
"#bdcdc5", "#083962")
# view the colours
#scales::show_col(colour)
# plot in a bar plot
barplot(height = table(pokemon_500$type1),
col = colour,
horiz= TRUE, las= 1,
xlim = c(0, 20),
xlab = "Frequency",
main = "Freqency of Pokemon types\n with total greater than 500")If you are wondering how the colouring works, we are using the factor
levels of the type1 column. If you type
levels(pokemon_500$type1) you’ll get a vector with the 18
different factor levels, with Bug being 1 and Dark being 2 and so on.
The first element in our colour vector therefore matches up with the
first level of the type1 factor, which is bug.